P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
## [1] "Frequency distribution of quality"
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## [1] "Summary of variable fixed.acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Summary of variable volatile.acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] "Summary of variable citric.acid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Summary of variable residual.sugar"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## [1] "Summary of variable chlorides"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] "Summary of variable free.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] "Summary of variable total.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## [1] "Summary of variable density"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] "Summary of variable pH"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] "Summary of variable sulphates"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] "Summary of variable alcohol"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
quality is slightly left skewed.fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide,total.sulfur.dioxide and alcohol are approximately poisson distributed.residual.sugar, chlorides, and sulphates seems to have long tail on the positive side.density and pH are roughly normally distributed.## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The data set contains information about 1599 red variants of the Portuguese “Vinho Verde” wine. There are twelve variables about each wine.
The varibale quality is the dependent variable, while the rest eleven variables are independent variables. The dependent variable is the one we hope to gain better understand about in the dataset.
## [1] "correlation between each dependent variable and quality"
## fixed.acidity volatile.acidity citric.acid
## 0.124 0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## 0.014 0.129 0.051
## total.sulfur.dioxide density pH
## 0.185 0.175 0.058
## sulphates alcohol
## 0.251 0.476
The correlations between any single independent variable and the independent variable are not strong. We possibly will need them working together to help predict the wine quality.
I created a variable called leqFive indicating whether the wine has a quality less than or equal to five. The reason I created this variable is that, there are 47% wines with a quality less than or equal to 5 and 53% wines with a quality greater than or equal to 6. Also the proportion of wines that have a quality 5 or 6 is 82% of all wines. So it will be very important if we can distinguish wines with quality less than equal to 5 and wines with quality greater or equal to 6.
I think the dependent variable is better treated as a categorical variable, thus I am turning it into type factor in R.
To predict the wine quality after this transformation, the problem is now a classification problem.
## [1] "Summary of variable fixed.acidity By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.230 12.600
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 94 18.737 6.283 8.79e-06 ***
## Residuals 1593 4751 2.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable volatile.acidity By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 8.22 1.645 60.91 <2e-16 ***
## Residuals 1593 43.01 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable citric.acid By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.53 0.7059 19.69 <2e-16 ***
## Residuals 1593 57.11 0.0359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable residual.sugar By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 10 2.094 1.053 0.385
## Residuals 1593 3166 1.988
## [1] "Summary of variable chlorides By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.066 0.013162 6.036 1.53e-05 ***
## Residuals 1593 3.474 0.002181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable free.sulfur.dioxide By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 2571 514.1 4.754 0.000257 ***
## Residuals 1593 172274 108.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable total.sulfur.dioxide By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 128045 25609 25.48 <2e-16 ***
## Residuals 1593 1601155 1005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable density By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9962 0.9976 0.9975 0.9988 1.0010
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9956 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0030
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0040
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0030
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.000230 4.594e-05 13.4 8.12e-13 ***
## Residuals 1593 0.005462 3.430e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable pH By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.162 3.230 3.267 3.350 3.720
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.51 0.10242 4.342 0.000628 ***
## Residuals 1593 37.58 0.02359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable sulphates By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.00 0.6000 22.27 <2e-16 ***
## Residuals 1593 42.91 0.0269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable alcohol By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 483.9 96.79 115.9 <2e-16 ***
## Residuals 1593 1330.8 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Correlation matrix for independent variables"
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 0.256 0.672
## volatile.acidity 0.256 1.000 0.552
## citric.acid 0.672 0.552 1.000
## residual.sugar 0.115 0.002 0.144
## chlorides 0.094 0.061 0.204
## free.sulfur.dioxide 0.154 0.011 0.061
## total.sulfur.dioxide 0.113 0.076 0.036
## density 0.668 0.022 0.365
## pH 0.683 0.235 0.542
## sulphates 0.183 0.261 0.313
## alcohol 0.062 0.202 0.110
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.115 0.094 0.154
## volatile.acidity 0.002 0.061 0.011
## citric.acid 0.144 0.204 0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 0.022
## pH 0.086 0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 0.221 0.069
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.113 0.668 0.683 0.183 0.062
## volatile.acidity 0.076 0.022 0.235 0.261 0.202
## citric.acid 0.036 0.365 0.542 0.313 0.110
## residual.sugar 0.203 0.355 0.086 0.006 0.042
## chlorides 0.047 0.201 0.265 0.371 0.221
## free.sulfur.dioxide 0.668 0.022 0.070 0.052 0.069
## total.sulfur.dioxide 1.000 0.071 0.066 0.043 0.206
## density 0.071 1.000 0.342 0.149 0.496
## pH 0.066 0.342 1.000 0.197 0.206
## sulphates 0.043 0.149 0.197 1.000 0.094
## alcohol 0.206 0.496 0.206 0.094 1.000
volatile.acidity, density and pH tend to decrease as the quality of wine get higher.citric.acid, sulphates and alcohol tend to increase as the quality of wine get higher.fixed.acidity, residual.sugar and chlorides does not seems to vary with quality.free.sulfur.dioxide and total.sulfur.dioxid seems to be lower in low quality and high quality wines and higher in middle quality wines.free.sulfur.dioxide and total.sulfur.dioxide as well as between fixed.acidity and citric.acid are higher than 0.6. We might consider only use one of the two correlated variables when building the model.Variables volatile.acidity, density, pH, citric.acid, sulphates and alcohol tend to change as the quality of wine get higher.
Variables free.sulfur.dioxide and total.sulfur.dioxide as well as fixed.acidity and citric.acid are moderately correlated. We might need to exclude correlated variables in model building.
The correlation coefficient between free.sulfur.dioxide and total.sulfur.dioxide is 0.683, which is the highest among all possible pairs of variables.
In these plots, we are searching for the right combination of independent variables that seems to be able to support a clear sperating line in the plot that distinguish the wines with quality less than or equal to five with those higher than five in quality.
Judging from the plots, I would say the the combination of sulphates and alcohol, the combination of chlorides and alcohol, the combination of volate.acidity and alcohol, and the combination of volatile.acidity and sulphates seem to able to help us distinguish wines with higher quality(\(\geq 6\)) and wines with lower quality(\(\leq 5\)).
Even though free.sulfur.dioxide and total.sulfur.dioxide are moderately correlated with each other, based on the plots, many low quality(\(leq 5\)) wine tend to have higher value in total.sulfur.dioxide for a given value of free.sulfur.dioxide. So the combination of the two variable seems to be able to provide some explanation for wine quality.
set.seed(0306)
# find number of variables to use
wine.rfcv <- rfcv(trainx = wine[,-c(12,13)],
trainy = wine$quality,
cv.fold=5)
plot(wine.rfcv$n.var, wine.rfcv$error.cv, pch = 19, type = "b")
# find parameter value `mtry`
wine.tunedRF <- tuneRF(x=wine[,-c(12,13)],
y=wine$quality)
## mtry = 3 OOB error = 29.52%
## Searching left ...
## mtry = 2 OOB error = 29.83%
## -0.01059322 0.05
## Searching right ...
## mtry = 6 OOB error = 30.46%
## -0.03177966 0.05
# fit randomForest model
set.seed(1126)
wine.rf <- randomForest(x=wine[,-c(12,13)],
y=wine$qualit,
ntree = 1500,
mtry = 3,
importance = T)
# see the importance of variables
importance(wine.rf)
## 3 4 5 6 7
## fixed.acidity -0.56111865 -0.3486899 48.42989 42.03243 46.28220
## volatile.acidity 8.06171651 14.6071562 66.20499 45.71393 74.95099
## citric.acid 3.89390048 3.3848477 41.64058 37.99165 46.74801
## residual.sugar 2.06429993 -2.9363878 47.47231 48.15396 42.69723
## chlorides -1.68556675 -3.9360030 57.00153 44.74722 40.27894
## free.sulfur.dioxide 0.06058017 2.2928119 46.74999 44.94971 37.86081
## total.sulfur.dioxide 3.32982546 1.9831489 69.41414 62.44532 56.57785
## density -4.09555243 -6.7015283 53.32682 57.13734 52.91719
## pH -0.93451508 6.6218486 48.66793 39.11223 39.63130
## sulphates 1.66961907 12.3961049 75.45682 67.87864 81.22823
## alcohol -0.15992995 5.4472305 105.11547 68.96970 92.39223
## 8 MeanDecreaseAccuracy MeanDecreaseGini
## fixed.acidity 6.569794 70.87984 77.95775
## volatile.acidity 12.661250 90.69189 107.14393
## citric.acid 11.550874 66.58893 75.73842
## residual.sugar 10.426693 75.53234 73.23089
## chlorides 9.189612 72.69759 83.28575
## free.sulfur.dioxide 9.015734 70.41771 68.46716
## total.sulfur.dioxide 12.624055 96.21514 106.75614
## density 9.432491 83.12110 94.75334
## pH 9.925760 70.75903 76.81289
## sulphates 18.908448 109.32273 113.46508
## alcohol 19.247203 130.01911 149.44436
# see in sample prediction confusion matrix
table(wine.rf$predicted, wine$quality)
##
## 3 4 5 6 7 8
## 3 0 1 0 0 0 0
## 4 1 0 1 1 0 0
## 5 8 36 562 121 10 0
## 6 1 15 112 484 76 9
## 7 0 1 6 32 112 7
## 8 0 0 0 0 1 2
# predciton accuracy
100*round(sum(diag(table(wine.rf$predicted, wine$quality)))/nrow(wine),4)
## [1] 72.55
I built a random forest model using all the indenpendent variables in the original dataset. The model gives a 72.55% in sample prediction accuracy, which is not very great.
The boundary quality \(\leq 5\) and quality \(\geq 6\) roughly divides the data set into two equal size halfs. 82% percent of the wines are of quality 5 or 6.
The median value of variable sulphates, alcohol and citric.acid tends to increase as the quality of the wine gets higher.
Even though single independent variable has very weak correlation with the wine quality. Combinations of two variables can support a seperating line that can classify wines with quality lower than or equal to 5 and wines with quality higher than 5.
The purpose of this data exploration is to identify the variables to be used to build model to predict wine quality. We find that no single variable can be used to indicate the wine quality well enough. Using combinations of variables we can get better ideal of the wine quality. I used random froest model to perform feature selection, the results suggest that we need to use all variables in hand. Based on the prediction results on the sample, most classification error occurs with quality 5 and quality 6. We might need to dive deeper to investigate in that direction.